-
Notifications
You must be signed in to change notification settings - Fork 39
feat: Make all collectionClient.list methods return value also be asyncIterator of relevant data
#790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
604c018 to
de2c43f
Compare
AsyncIterator to the apifyClient.listAsyncIterator to the StoreCollectionClient.list return value
AsyncIterator to the StoreCollectionClient.list return valueasyncIterator to the StoreCollectionClient.list return value
| export interface PaginationOptions { | ||
| /** Position of the first returned entry. */ | ||
| offset?: number; | ||
| /** Maximum number of entries requested. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /** Maximum number of entries requested. */ | |
| /** Maximum number of entries requested for one chunk. */ |
not sure if page or chunk is better, but it should be clear this is a limit for the chunk and not a total limit for the async iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the total limit for the whole iterator. Chunk size is limited by the length of the platform response; it is not limited by this code.
| return { | ||
| ...currentPage, | ||
| async *[Symbol.asyncIterator]() { | ||
| yield currentPage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we are yielding the whole pages? i thought we would yield just the items, one by one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my idea was that the iterator should yield exactly the same type as the original response, so that you can use the same code to process it. We also already have the whole chunk in memory. For example:
actors = await apifyClient.store().list({ limit, offset });
processActors(actors)
for await (const actorsChunk of actors) {
processActors(actorsChunk)
}
But I guess you would prefer?
actors = await apifyClient.store().list({ limit, offset });
processActors(actors)
for await (const singleActor of actors) {
processSingleActor(singleActor)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return the full pages, it will make the usage harder (forcing users to use nested loops). I would yield items one by one, not pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to allow iteration over individual items, while making API requests only for as big chunks as possible.
So for example when using limit 3000, it will make 3 requests (3x1000), but for await loop will run 3000x times.
|
I agree w/ both your remarks about the API - can we still brainstorm the expected interface? I would love something like for await (const actor of await client.store().list()) {
log(actor.name); // logs all the Actor's names
}and for await (const actor of await client.store().list({ limit: 10 })) {
log(actor.name); // logs only first 10 Actor's names
}Alternatively, we could even attempt something like - for await (const actor of await client.store().list()) {
+ for await (const actor of client.store().list()) {if we can attach the async iterator to the promise (I suppose we can, but I'm not sure how happy TypeScript would be about this). (Internally, all those solutions would still lazy-load some optimal-sized pages with the correct offsets.) |
Yes, this is exactly what I would like to see, but I am not entirely sure if it's doable. The |
type IterablePromise<TItem> = Promise<Iterable<TItem>> & AsyncIterable<TItem>;
async function fetchData() {
await new Promise(res => setTimeout(res, 1000));
return ['data1', 'data2', 'data3'];
}
function list(): IterablePromise<string> {
const itemsPromise = fetchData();
async function* asyncGenerator() {
const items = await itemsPromise;
for (const item of items) {
yield item;
}
}
Object.defineProperty(itemsPromise, Symbol.asyncIterator, {
value: asyncGenerator
});
return itemsPromise as any;
}
async function main() {
// treat the return value as Promise<string[]>
for (const item of await list()) {
console.log(item);
}
// treat the return value as AsyncIterator<string>
for await (const item of list()) {
console.log(item);
}
}
main();The only issue is that TypeScript requires the return value of an async function to be a |
Updated accordingly. |
| let itemsFetched = currentPage.items.length; | ||
| let currentLimit = options.limit !== undefined ? options.limit - itemsFetched : undefined; | ||
| let currentOffset = options.offset ?? 0 + itemsFetched; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling this could be simplified using the pagination fields from the response.
How about e.g.
const isLastPage = Math.min(page.total, options.limit + options.offset) <= page.count + page.offsetThis works because:
The options.limit and options.offset options select an interval of the list between
If the current page includes the
Then we could IMO do
let page = await fetchPage({ limit: options.limit, offset: options.offset });
yield* page.items;
/// The variable names are rather for explanation, not production ready
const lastItemIndex = Math.min(page.total, options.limit + options.offset);
let lastItemIndexFromThePreviousPage = page.count + page.offset;
while (lastItemIndexFromThePreviousPage < lastItemIndex) {
const remainingItemCount = lastItemIndex - lastPageItemIndex;
page = await fetchPage({ limit: remainingItemCount, offset: lastItemIndexFromThePreviousPage });
lastItemIndexFromThePreviousPage = page.count + page.offset;
yield* page.items;
}Having typed this out, I'm no longer convinced this is a better way... but feel free to get inspiration 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding pagination fields from the response. According to the documentation, they are not defined on all list endpoints. I think we can rely only on total and items. (Previously, I was under the impression that even total was optional, but re-checking the documentation, it seems it is always there.)
Here is an example of the minimal endpoint: https://docs.apify.com/api/v2/act-versions-get
(Tested and it does really return just items and total)
Therefore, I would define the algorithm in a way that does not require any optional fields from the response. But knowing I can get total allows at least some of the improvements you suggested.
I would also keep this in while condition currentPage.items.length > 0 condition, to handle any API problems or situations when the requested resources change during the iteration due to some external action. (Someone removing an actor could otherwise lead to an infinite loop)
const lastItemIndex = Math.min(page.total, options.limit + options.offset);
Since those fields are optional, this would not work for a user who defines just an offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://api.apify.com/v2/acts/.../versions is not a paginated endpoint though; it just lists all the items under items and the length of this array under total (see implementation here). See that it won't react to offset nor limit query params.
All paginated endpoints will use the Pagination class, which means those will always accept all the parameters (and return all the fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it in a way that can be applied to all the classes that were internally calling ResourceCollectionClient._list, even the "fake paginated endpoints".
Apply to other endpoints.
|
Some manual integration tests on all the modified endpoints was working fine on my account and one selected actor: |
asyncIterator to the StoreCollectionClient.list return valuecollectionClient.list method return also asyncIterator of relevant data
collectionClient.list method return also asyncIterator of relevant datacollectionClient.list methods return also asyncIterator of relevant data
collectionClient.list methods return also asyncIterator of relevant datacollectionClient.list methods also return asyncIterator of relevant data
collectionClient.list methods also return asyncIterator of relevant datacollectionClient.list methods return value also be asyncIterator of relevant data
barjin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I'd still like to groom the codebase a little, see my comments ⬆️
| exclusiveStartId?: string; | ||
| } | ||
|
|
||
| export interface PaginationOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if now all the XYZOptions interfaces could extend this interface to reduce the duplicate code
see e.g.
apify-client-js/src/resource_clients/webhook_dispatch_collection.ts
Lines 36 to 40 in 7dfc5b3
| export interface WebhookDispatchCollectionListOptions { | |
| limit?: number; | |
| offset?: number; | |
| desc?: boolean; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these interfaces are wrong. For example:
export interface ActorEnvVarCollectionListOptions {
limit?: number;
offset?: number;
desc?: boolean;
}
But the endpoint does not take any parameters (tested, and also the documentation does not mention parameters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied it to all the options and raised an issue with the platform people about the unused options
https://apify.slack.com/archives/C01VBUV81UZ/p1764238228809389
btw.
StoreCollectionListOptions does not have desc as the only option there :-(
| list( | ||
| options: WebhookCollectionListOptions = {}, | ||
| ): Promise<PaginatedList<Omit<Webhook, 'payloadTemplate' | 'headersTemplate'>>> { | ||
| ): Promise<PaginatedList<Omit<Webhook, 'payloadTemplate' | 'headersTemplate'>>> & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have a generic utility type, sth like
type PaginatedListIterable<T> = Promise<PaginatedList<T>> & AsyncIterable<T>to reduce the amount of code and simplify adding new endpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I saw IterablePaginatedList<Data> from utils.ts, why not use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work for the normal ones, but there are these weird ones that I am not sure how to fit into such new type.
Normal one:
Promise<PaginatedList<DatasetCollectionClientListResult>> & AsyncIterable<DatasetCollectionClientListResult> == PaginatedListIterable<DatasetCollectionClientListResult>
Weird ones:
Promise<Pick<PaginatedList<ActorEnvironmentVariable>, 'total' | 'items'>> & AsyncIterable<ActorEnvironmentVariable>Promise<Pick<PaginatedList<FinalActorVersion>, 'total' | 'items'>> & AsyncIterable<FinalActorVersion>Promise<PaginatedList<Omit<KeyValueStore, 'stats'> & { username?: string }>> & AsyncIterable<KeyValueStore>Promise<PaginatedList<RequestQueue & { username?: string }> & { unnamed: boolean;}> & AsyncIterable<RequestQueue>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied it to the normal ones, but the special ones will stay special.
I have no strong opinion on the name.
B4nan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few questions and ideas, otherwise it looks great!
| async function* asyncGenerator() { | ||
| let currentPage = await paginatedListPromise; | ||
| yield* currentPage.items; | ||
| const offset = options.offset || 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const offset = options.offset || 0; | |
| const offset = options.offset ?? 0; |
(not that it would matter much in here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| async list(options: ActorCollectionListOptions = {}): Promise<ActorCollectionListResult> { | ||
| list( | ||
| options: ActorCollectionListOptions = {}, | ||
| ): Promise<PaginatedList<ActorCollectionListItem>> & AsyncIterable<ActorCollectionListItem> { |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this thread with #790 (comment)
janbuchar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
barjin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, thank you!
I noticed a few discrepancies in the (existing) types during the review, I made a separate issue (let's merge this).
|
Great stuff! How about we add some example to the main docs (not just function/class comments)? |
Yes, I think this example could be updated: https://docs.apify.com/api/client/js/docs#pagination But first, I have to explore a little how this change should be applied to list methods of |
Description
Add
ResourceCollectionClient._getIterablePaginationwhich extends the return value ofResourceCollectionClient._listbyasyncIteratorthat can be used to iterate over individual items. (It is made in a generic way and can be applied to various endpoints if desired.)Apply
_getIterablePaginationto:ActorCollectionClient.listActorEnvVarCollectionClient.listActorVersionCollectionClient.listBuildCollectionClient.listRunCollectionClient.listDatasetCollectionClient.listKeyValueStoreCollectionClient.listRequestQueueCollectionClient.listScheduleCollectionClient.listStoreCollectionClient.listTaskCollectionClient.listWebhookCollectionClient.listWebhookDispatchCollectionClient.listAdd unit tests of async iteration for the single representative class
DatasetCollectionClient.list(it works the same way on all the classes)Example usage
It can still be used the same way as before:
Or it can be used as
asyncIteratorthat can return more individual items from more than one chunk based on thelimit, andoffsetoptions and also based on the number of items returned from the API:Issues